A theory-based model of cumulative activity

Energy expenditure can be used to examine the health of individuals and the impact of environmental factors on physical activity. One of the more common ways to quantify energy expenditure is to process accelerometer data into some unit of measurement for this expenditure, such as Actigraph activity counts, and bin those measures into physical activity levels. However, accepted thresholds can vary between demographics, and some units of energy measurements do not currently have agreed upon thresholds. We present an approach which computes unique thresholds for each individual, using piecewise exponential functions to model the characteristics of their overall physical activity patterns corresponding to well established sedentary, light, moderate and vigorous activity levels from the literature. Models are fit using existing piecewise fitting techniques and software. Most participants’ activity intensity profile is exceptionally well modeled as piecewise exponential decay. Using this model, we find emergent groupings of participant behavior and categorize individuals into non-vigorous, consistent, moderately active, or extremely active activity intensity profiles. In the supplemental materials, we demonstrate that the parameters of the model correlate with demographics of age, household size, and level of education, inform behavior change under COVID lockdown, and are reasonably robust to signal frequency.


Appendix A: Changes in Behaviour During COVID-19
The COVID-19 pandemic resulted in a dramatic shift in behaviour for many individuals as lockdowns and other measures went into effect around the world. Studies using questionnaires and self-reported data have been conducted in several countries including Spain [1], Italy [2], and the United Kingdom [3], and all have shown physical activity levels have lowered dramatically since pre-pandemic times. A systematic review of studies across multiple populations, using self-report and accelerometer data showed that in 64 out of 66 studies included in the review physical activity declined during COVID-19 lockdowns [4].
Step count information taken from the fitness app Argus showed that the mean number of daily steps taken dropped by 27.3% within 30 days of the World Health Organization announcement of the COVID-19 pandemic [5].
The first wave of INTERACT data and was collected between May 31, 2017 and February 21, 2019 in four Canadian cities, Victoria, Vancouver, Saskatoon, and Montreal. The second wave of this study took place in 2019 and 2020, during the COVID-19 pandemic in Vancouver, Saskatoon, and Montreal. To see how this event impacted our metrics, we looked only at participants who took part in and provided sufficient data to both the first and second waves of data collection. Collecting data during the pandemic was extremely challenging with participants not wanting to return to the study, resulting in a significantly lower number of participants in the second wave. To compensate for the lower number of participants, we lowered our filtering thresholds from 600 minutes per day for three days to 200 minutes per day for three days for this analysis. There were a total of 50 participants who met these conditions. Figures 1a and 1b show the distribution of all minutes among all common participants for INTERACT waves 1 and 2. Comparing these figures indicates that the peak physical activity of our returning participants is slightly lower than in the first wave of data collection. In wave 2, the distribution in the moderate and vigorous physical activity levels appear to be much flatter and closer to 0 when compared to wave 1. Table 1 compares how the 50 returning participants profiles change between wave 1 and wave 2, and it indicates that several participants who were somewhat active before the pandemic have become less so.

Appendix B: Model Resilience to a Lack of Data
Data records are sometimes lost, through physical device failure, lack of participant compliance, or other factors. It is important to know how much of this data can be lost before a participant's data should be excluded from the study. Several studies have approached this problem by filling in missing records with existing records by assuming the records are lost at random and filling the records accordingly [6] or using complete records from similar individuals to model what sort of records would be produced at the times the records are missing [7]. Studies have also been performed to identify the minimum number of records to accurately represent physical activity levels using step counts [8] and accelerometer data [9]. Parametric models explicitly allow interpolation of missing data, as the derived function has continuous support. Our model could be used to replace missing data at the aggregate level. To characterize the degree to which the model could interpolate the data we performed a downsampling sensitivity analysis.
To observe the effects that a sparse dataset would have on our model, we created new datasets from the original physical activity datasets. A cycle of 1 minute kept, X minutes skipped was used to construct the X-minute level dataset, where the 0-minute level dataset is the original dataset and the values of X examined were 1, 2, 4, 8, 16, 32, and 64. At the 1-minute level only half of the original dataset is retained, at the 2-minute level only one third of the original dataset is retained, and so on. To avoid all participants being removed from the dataset in the filtering stages at higher values of X, and to allow the resulting metrics to be comparable to those of the original datasets, each minute that is skipped in an X-minute level dataset is instead overwritten by the previous minute that was kept. These datasets were constructed for both the INTERACT and NHANES datasets.
In Tables 2 and 3, we see the number of participants with each activity intensity profile in each of the X-minute level datasets, with Table 2 representing the NHANES dataset and Table 3 representing the INTERACT dataset, along with their X-minute level derivatives. Note that the total number of participants may vary between tables as some differences in the datasets may still cause some participants to be filtered out when they were not filtered out in the original dataset.
As seen in these tables, as more data is removed from the database, more the activity intensity profiles are non-vigorous, likely because the data points in the tail are usually the lowest frequency records for a given participant. As more and more data are removed from the database, it is increasingly likely that the majority of each participant's vigorous activity records are also lost, to the point that the participant has too few data points in the tail, and their activity intensity profile becomes non-vigorous. As the number of minutes skipped increases, the number of participants whose profiles are outliers also increases, although as shown in the tables below those participants' profiles eventually become non-vigorous once sufficient data is removed. In particular, we see that the extremely active activity intensity profiles rapidly deteriorate as early as the 1-minute (50%) level.
One notable trait which the NHANES dataset possesses but is not reflected in Table B-1 is that over 1000 participants with a non-vigorous profile in the original dataset have a moderately active profile in the datasets ranging from the 1-minute level to the 4-minute level. This is not reflected in the table because a much greater number of participants with a moderately active profile in the original dataset have a non-vigorous profile in those same datasets. This pattern is essentially nonexistent in the INTERACT data. Figure 2 show the overall distribution for all participants in the both datasets for the original set, 1-minute level, 2-minute level, and 4-minute level datasets, as well as a boxplot marking the breakpoints for each participant's resulting model from that dataset, with the INTERACT plots appearing on the left and the NHANES plots appearing on the right. In both datasets, we see the tail decline and the points within the third slope of the model draw closer to 0 as more and more data is removed, which also demonstrates the rapid decline of the extremely active participants and the steady migration of participants to non-vigorous profiles. We see that as the dataset has a lower measurement frequency, the model underestimates the time spent performing vigorous activity.   0  274  55  134  19  31  1  306  45  96  7  59  2  328  43  60  1  81  4  360  17  49  0  87  8  403  7  33  0  70  16  439  0  21  0  53  32  481  1  14  0  17  64  502  1  2  0  8 3/12

Appendix C: Participant Demographic and Activity Intensity Associations
How different demographic groups differ in terms of physical activity levels has been the topic of many studies. Marital status has been found to be an indicator of lowered activity levels, particularly among women [10,11]. Studies have found that women in general have lower levels of physical activity when compared to men [11,12,13], that individuals tend to have lower physical activity levels as they get older [12,13], and that education levels appear to correlate with in physical activity [11,13]. During the NHANES study, a number of surveys were taken by the participants, including a demographics survey which included the participant's age in years, the participant's ethnicity, what education level the participant has attained, the number of individuals in the participant's household, the participant's gender, and the annual income of the participant's family. Not all participants in the dataset have a response recorded for every question.
To determine whether or not our metrics reflect the differences in participant demographics, we used an Analysis of Variance (ANOVA) to test the association between each of these demographics and our metrics. Of the metrics we produce, the activity intensity profile and maximum value appear to have the strongest correlation with the listed demographics. Of these demographics, age, education, and household size are associated with the breakpoint metrics across nearly all of the cases when examining the corresponding p-values. Table 4 shows the ANOVA coefficients for each decade of the age survey question. Table 5 shows the same corresponding values for the education levels of adults, while Table 6 show those values for household sizes.
Looking at Table 4, the most obvious correlation between age and our metrics is with the distribution's maximum value, which steadily decreases with each decade of the participant's life, meaning that the older a person is, the lower their maximum physical activity output. The model's second and third breakpoints follow this trend, while the first breakpoint has the opposite behaviour. Meaning that compared to those 0-19 years old, sedentary activity increases over time for each 10 years of age category, while vigorous activity decreases over time for each 10 years of age group, compared to the 0-19 group. Light and moderate activity increase until age 49, then decrease after age 49, when compared to the 0-19 age group. The first slope, which represents the rate of decay of sedentary activity, appears to increase with age, while the other three slopes do not appear to have a discernible trend when comparing age groups.
Referencing Table 5, we see the maximum activity tends to have a positive correlation with education level. The first slope and the second breakpoint also increase with higher education, while the second slope decreases. The remaining metrics do not appear to have a strictly positive or negative correlation with the individual's level of education.
When we compare the individual's household size to our metrics in Table 6, the maximum increases with the number of household members, while the first slope decreases. The third breakpoint and the second slope increase with household members, but this increase stops around 4 to 5 members, after which the values become relatively constant. The first slope is similar to the previous metrics, but decreases until 4 members. Compared to having one or two people in the household, the second and third breakpoints are greater for households with three or more people. Meaning that households with more people tend to have higher breakpoints for the separations between light and moderate, and moderate to vigorous activity. This could be due to households with more people tending to younger and likely including children. The remaining slope metrics once again did not appear to correlate with the number of people in the household. Figure 4 is a line graph showing the ages of each of our activity intensity profiles. Each line represents what proportion of the age group has profiles that are non-vigorous, consistent, moderately active, and extremely active. The fraction of non-vigorous activity intensity profiles appears to be steady across all age groups except children, while the consistent and moderately active activity intensity profiles appear to be mostly dominated by children and young adults. The extremely active activity intensity profiles appear to have a large number of middle-aged members. Figure 3a shows how each activity intensity profile breaks down in terms of how many people live in their household, while 3b shows how each profile is broken down in terms of their education level.

Appendix D: Weekday vs Weekend Behaviours
Given the differences in routines many individuals have between weekdays and weekends, one would expect the day of the week to influence energy expenditure. It has been shown that adults using the built environment gain different benefits depending on the day of the week [14]. Employees who walk to work have greater physical activity levels than those who drive, but this is not apparent when comparing only weekend physical activity [15]. When examining studies based on children, there are some that conclude children have higher physical activity on weekends than weekdays [16] as well as others which conclude the opposite [17].
To compare how well our metrics differentiate between weekday physical activity and weekend physical activity, both the NHANES and the INTERACT datasets were divided into weekdays and weekends. From each of these four new datasets, we recalculated a set of corresponding metrics using the same process that was used for the original two datasets. Figure 5 shows the distributions of all participants in the INTERACT dataset for their weekday and weekend data, while Figure 6 shows the same for the NHANES dataset. Table 7 shows how many participants have each profile for the weekday and weekend datasets of both INTERACT and NHANES. In both datasets, the number of participants with a non-vigorous profile is much higher on the weekends, with a larger portion of the consistent and moderately active weekday participants being less active on weekends when compared to the number of extremely active participants that are less active on weekends.

Appendix E: Computation Runtime
To demonstrate the expected runtime of the model and various steps in its computation, the script was run on a desktop PC with 15 GB RAM and an Intel Core i7-8700 processor, using the NHANES dataset. For each participant, the following steps were performed and timed in order: loading the participant's data file, preprocessing the data, filtering out participants with insufficient data, sorting each minute's data points into data bins, computing the participant's metrics assuming a 4-segment line is used, and computing the participant's metrics assuming a 3-segment line is used. Participants who are filtered out are still included in the data filtering step and all previous steps mentioned above. The median participant runtime values are reported in Table 8. The largest bottleneck in the process was loading the participant's data into memory, followed by data preprocessing, then fitting the 4-segment line model and computing the resulting metrics, then binning the data points, then filtering insufficient data, and finally fitting the 3-segment line model and computing the resulting metrics. All distributions in this section are displayed with a logarithmic y-axis. Figure 7a shows the distribution of all runtimes of all participants for the NHANES dataset, including those that are filtered out, while Figure 7b displays only those participants who are not filtered out at any stage. We clearly see this difference as there are a small number of runtimes well below the median runtime in Fig. 7a that are not present in Fig. 7b.